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Epidemiological models have highlighted the importance of population struc- 
ture in the transmission dynamics of infectious diseases. Using HTV-1 as an 
example of a model evolutionary system, we consider how population structure 
affects the shape and the structure of a viral phylogeny in the absence of strong 
selection at the population level. For structured populations, the number of 
lineages as a function of time is insufficient to describe the shape of the phylo- 
geny. We develop deterministic approximations for the dynamics of tips of the 
phylogeny over evolutionary time, the number of 'cherries', tips that share a 
direct common ancestor, and Sackin's index, a commonly used measure of phy- 
logenetic imbalance or asymmetry. We employ cherries both as a measure of 
asymmetry of the tree as well as a measure of the association between sequences 
from different groups. We consider heterogeneity in infectiousness associated 
with different stages of HIV infection, and in contact rates between groups of 
individuals. In the absence of selection, we find that population structure may 
have relatively little impact on the overall asymmetry of a tree, especially when 
only a small fraction of infected individuals is sampled, but may have marked 
effects on how sequences from different subpopulations cluster and co-cluster. 



1. Introduction 

Viruses, especially RNA viruses such as human immunodeficiency virus type 1 
(HIV-1), hepatitis C virus and influenza A virus, may exhibit a great deal of gen- 
etic variation at the population level, allowing the reconstruction of viral 
phylogenies that reflect the past transmission of the virus. The shape of the phy- 
logeny can tell us a great deal about how population processes, such as changes 
in population size and geographical population structure, and immunological 
processes, such as selection on the virus to escape immune responses, interact [1]. 
For example, 'star-like 7 phylogenies are typical of populations that are growing 
exponentially, while 'ladder-like' phylogenies are consistent with a model 
where one variant is replaced by another due to immune escape. This integration 
of ecological, epidemiological and evolutionary processes has been dubbed 'phy- 
lodynamics' [2]. Phylodynamic approaches have been used in hundreds of 
studies of viruses [3] and have generated important insights into the transmission 
dynamics of many viral pathogens, such as the spread of HIV in the UK [4,5], as 
well as the geographical spread of influenza A [6-8]. The information obtained 
by applying phylodynamic models to viral sequence data would be hard, if not 
impossible, to obtain through more classical epidemiological approaches. 

The majority of phylodynamic studies have employed models derived from 
simple population dynamic models of single species. However, these models 
may be inappropriate when considering the spread of a virus in a population. 
A key quantity in these models is the coalescence rate, the rate at which lineages 
coalesce in a phylogeny as we go backwards in time from the present. From the 
coalescence rate, these models generate estimates of effective population size or 
N e , which is commonly (mis)interpreted as being proportional to the number 
of infected individuals. Previously, we have demonstrated that the coalescence 
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rate of an infectious disease is related not only to the preva- 
lence, but also to the rate of transmission (i.e. the incidence) 
[9]. Consequently, the conclusions of previous studies, par- 
ticularly those that integrate viral sequence data with 
information on prevalence, may have to be reinterpreted. 
The use of epidemiological models to underpin viral evol- 
utionary models can lead to results that are more easily 
interpretable, and permit the inclusion of prior information, 
such as that on the duration of the infectious period, as 
well as facilitating the integration of phylogenetic data with 
other forms of epidemiological data [10]. 

Recently, there has been increased interest in considering 
the phylodynamics of structured populations, for example, in 
the context of studying the spatial spread of viruses from 
sequence data ('phylogeography') [11]. In addition, there 
are many other forms of heterogeneity that may be impor- 
tant, including differences by age, duration of infection, 
contact rate, infectiousness, susceptibility, treatment or vacci- 
nation status, etc., depending on the system being studied. 
When data are available on which subpopulation a viral 
sequence is associated with, there are a variety of tests that 
can be used to assess whether there is significant population 
structure (see Zarate et at. [12] for a comparison of several 
approaches to within-host HIV population structure). A par- 
ticular challenge arises when data on the subpopulations are 
lacking. For example, while acute HIV infection is associated 
with higher infectiousness, information on the time since 
infection may not be available; similarly, while there may 
be differences in contact rates between different subpopu- 
lations, many molecular epidemiological studies of HIV do 
not collect behavioural data. Recently, Leventhal et al. [13] 
took an inventive approach to this problem; using a phylo- 
geny from the Swiss HIV epidemic, they showed that the 
phylogeny was significantly more unbalanced than expected 
from a simple model of random mixing, which they argued 
could be due to contact structure in the at-risk population. 
They used a measure of tree balance, Sackin's index [14], 
and derived an approximation of the expectation of Sackin's 
index given a transmission network. Of note, they did not 
present a similar approximation for Sackin's index given 
the underlying contact network. 

In this study, we consider how population structure may 
affect phylodynamic patterns, using HIV-1 as a model 
system. We first introduce the notion of tree imbalance or 
asymmetry. We then review our framework for modelling 
coalescence using ordinary differential equations, presenting 
another perspective on our past results, which we extend to 
consider the dynamics of external branches (or tips or 
leaves) of the phylogenetic tree. This device allows us to 
model cherries [15], pairs of tips that share a direct 
common ancestor, which we use to capture both tree asym- 
metry as well as population structure. We also derive an 
approximation to Sackin's index as a complementary 
measure of tree asymmetry. We apply this approach to deter- 
mine how (i) higher infectiousness during acute infection and 
(ii) the presence of a high-risk group with a high contact rate 
may affect the shape and the structure of the viral phylogeny. 



2. Asymmetry of phylogenetic trees 

The most widely used model used in studies of viral phylo- 
dynamics is the time-varying coalescent model [16], which 



considers the genealogical process in a population that 
changes size in a deterministic fashion according to some 
relative size function, v(r), where r is time measured in gen- 
erations, starting with the present and going backwards. 
Assuming a sample of n individuals taken at time r = 0, 
and that the sample can be traced back to a single common 
ancestor with probability one, the dynamics of the number 
of distinct ancestors of the sample at time r is modelled as 
a stochastic process {A n (r), r > 0}, which starts at A n (0) = n, 
and moves down in steps of 1 until reaching 1, at which 
point the sample has been traced back to the common ances- 
tor. In a small time-step h, the transition probabilities are 
determined by the following: 
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This model assumes that the rate of coalescence between 
any two lineages is the same for all pairs of lineages, but 
varies over time. If the rate of coalescences varies between 
lineages at a given time, then this may have an impact on 
the shape of the tree [17]. Hence, deviations of the shape of 
an inferred tree from that expected under the coalescent 
model suggests that additional biological complexity may 
need to be considered. There are a number of different measures 
of tree shape that can be used for this purpose [18,19], but we 
focus on two specific measures: the number of 'cherries' [15] 
and Sackin's index [14]. Figure 1 illustrates how these stat- 
istics are calculated for two small trees, one symmetric and 
one asymmetric. Measures of tree shape tend to consider 
another stochastic process that generates trees, a linear birth 
or the Yule process [20]; however, as this model gives the 
same probability distribution on cladograms (i.e. the top- 
ology of the tree) [21], results on asymmetry for the Yule 
process also hold for the coalescent model. 

Cherries are defined as the number of tips that share a 
direct ancestor, which are generated when two tips coalesce. 
The expected number of cherries in a tree with n taxa under a 
Yule or coalescent model is n/3 [15]. In an asymmetric tree, 
tips tend to coalesce with branches deeper in the tree, and 
there are fewer cherries than expected. We denote the 
number of cherries as C, which is not to be confused with 
another measure of asymmetry, Colless' index [22]. 

Sackin's index is a measure of the topological distance 
from the tips of the tree to the root and is defined as follows. 
If the distance dj of a leaf; is the number of internal nodes that 
need to be traversed when following the path from the root of 
the tree to a leaf j, then Sackin's index is the sum of all such 
paths, Is = J2j dj. The expectation of Sackin's index for n taxa, 
E(J s (n)), under a Yule or coalescent process [23] is as follows: 
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where is the polygamma function of order 0, and y e the 
Euler-Mascheroni constant 0.577). For large n, E(J s (w)) w 
In log(n). As Sackin's index increases with sample size, it is 
often standardized by dividing by the number of sequences. 
Although this has a direct biological interpretation — the mean 
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cherries = (1,2), (5,6) 
Sackin's index = 3 + 3 + 2 + 2 + 3 + 3 = 16 



cherries = (5,6) 
Sackin's index = 1 +2 + 3 + 4 + 5 + 5 = 20 



Figure 1. Schematic illustrating cherries and the calculation of Sackin's statistic for a symmetric (a) and an asymmetric (b) six-taxon cladogram. 



root-to-tip distance (in terms of nodes) — we employ a different 
standardization used by Leventhal et ah [13], which is as follows: 

k(n) - E(J s (n)) 



(2.4) 



Under the null Yule or coalescent model, Is (n) — 0, which 
allows one to assess deviations from the null model more 
easily. We calculated these tests for two HIV phylogenies, 
one from an early clinical trial, ACTG 241 [9,24], and another 
of group M viral sequences sampled from HIV-infected indi- 
viduals from the Democratic Republic of Congo [25-27]. 
Both trees show moderate, but statistically significant, evi- 
dence of asymmetry (see the electronic supplementary 
material, figure SI). For both of these datasets, the sequences 
were sampled at approximately the same time. Although 
many viral datasets are collected in serial samples, and this 
can result in more asymmetric trees than if sequences are 
sampled at a single timepoint (see the electronic supplemen- 
tary material, figure S2), in order to keep the exposition 
simple, we will consider sampling at a single timepoint, 
although the approach taken here can also be extended to 
serial samples. 

The number of cherries and Sackin's index complement 
each other well, as the number of cherries captures asymme- 
try in the recent evolutionary past, while Sackin's index 
captures asymmetry over the entire evolutionary history of 
the sample, and simulations demonstrate that these statistics 
are only weakly correlated under the coalescent model (see 
the electronic supplementary material, figure S3). 



3. Tree shape in a simple model of HIV infection 

To investigate how the shape of a viral phylogeny is linked to 
transmission, we first considered a simple model commonly 
used to study the spread of HIV among men who have sex 
with men (for a comparison of the deterministic and stochas- 
tic version of this model, see Jacquez & Simon [28]). If S 
denotes the number of susceptible individuals and I denotes 
the number of infected individuals, the rates of change of 
S and I are as follows: 
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Here, /3 is the per-contact probability of infection, c the contact 
rate, \x represents the natural mortality rate, y denotes the 
excess mortality caused by infection, and A is the rate of immi- 
gration/birth of new susceptibles. The dynamical behaviour of 
the model depends on the value of the basic reproductive 
number Ro = /3c/(/x + y). If R 0 > 1 in this model, the 
number of infected individuals initially increases exponen- 
tially, plateaus, and finally reaches an equilibrium (figure 2a). 

(a) The number of lineages as a function of time 

For the model (3.1) -(3.2), the phylogenetic structure can be 
captured by the number of lineages as a function of time 
(NLFT), denoted A(s), where S is time going backwards 
from the present to the past. A differential equation describing 
the dynamics of A can be derived by first recognizing that the 
NLFT decreases as a consequence of transmission, but only if 
both lineages involved in the transmission are sampled. 

Let A(s) denote the set of lineages that are ancestral to the 
sample at time s, so that A(s) = \A(s)\.U(s) will denote the set 
of lineages which are not ancestral to the sample and will 
have cardinality U(s). Lower-case symbols will denote 
elements of these sets: a E A and u E ZY. 0 will denote the 
removal of a lineage. At each internal node of the tree, we 
denote the types of daughter lineages i and / and the state 
of the parent k using the notation (/,;) — > k. The possible 
types of transition as we go backwards in time are as follows: 
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Figure 2. (a) Dynamics of the number of infected individuals, / (black line), and v = It/2 (dashed line) over time in weeks based on equations (3.1) -(3.2), as well 
as estimates of 'scaled effective population size' obtained from applying a Bayesian skyride (grey) to simulated data generated from a forwards-time stochastic 
version of the model, with 100 replicates, (b) Dynamics of the mean generation time, r. Parameter values and initial conditions are as follows: /3 = 0.01, c = 1, 
P = 3640' y = 520' ^ = ^m' ^) = 9999, 1(0) = 1, with a simulation time of 40 years. Simulations of the differential equations were performed using the 
simecol library [29] in R [30], fitting of the skyline plot used the INLA library [31], while the stochastic simulations were performed using SimPy v. 1.9.1 in Python (see 
[3] for more details). Simulations were conditioned on reaching a quasi-equilibrium state, and registered by aligning the peaks of the simulated number of infected 
individuals to the peak of infected individuals from the ordinary differential equations. Code to perform simulations is available from http://code.google.eom/p/ 
simonfrost. 



The above transitions assume that the population sizes for 
I, A and U are large, such that 1(1 — 1) « I 2 , etc., and we con- 
sider sampling with replacement for the coalescence of 
lineages. The first type of transition, (a, a) reflects the 
decrease in lineages when there is an infection involving two 
sampled lineages. Infections where neither the source nor the 
recipient individual are sampled do not affect the number 
of lineages in the sample, and so we do not need to consider 
transitions of the form (w, w) — > u for the NLFT. The transition 
(a,u) — > a occurs either when a sampled individual infects 
another individual but we do not sample the latter individual, 
or when an unsampled individual infects a sampled individ- 
ual. These transitions reflect what we have described as an 
'invisible' transmission [32,33]. While such transitions do not 
affect the number of lineages, for structured populations they 
may result in a change in state of the lineage, and so are impor- 
tant for more complex models, which we will demonstrate 
later. The removal of lineages, for example by death or recov- 
ery of infected individuals, while changing the number of 
unsampled lineages (0 — > u), does not directly affect the 
number of sampled lineages, as the transitions 0 — > a is not 
possible; in addition, the transition (a, u) — > u is not possible. 
These transitions suggest the following set of differential 
equations for the dynamics of A and U: 

dA(s)_ A A 

-^—-kjj (3-4) 
and 

dU(s) UU Jf UA 

~dT" " /si TT " 2 ^Jl +SmL (3 ' 5) 

Transitions in this model are more complex than those con- 
sidered in a simple Wright -Fisher model. Firstly, an infection 
gives rise to another through transmission, such that there are 



overlapping 'generations'. Secondly, sampling effects enter 
into the transition rates. Nevertheless, there is a one-to-one cor- 
respondence of the coalescence rate in this epidemiological 
model with that in standard population genetics models 
widely used in viral phylodynamic studies. In a haploid 
Wright -Fisher model, the dynamics of the effective popu- 
lation size N e is in units of generations. The resulting 
estimates from models fitted to phylogenies where branch 
lengths are in real time are usually interpreted as N e T, or the 
'scaled effective population size', where N e is the 'effective 
number of infections' and r the generation time. For the 
model given by equations (3.1) -(3.2), the coalescence rate is 
the same as a haploid Wright -Fisher model if we define the 
number of infected individuals 1/2 as the 'effective number 
of infections' and the generation time t = f$i/I = /3cS/N (the 
incidence-to-prevalence ratio [34]), with the 'scaled effective 
population size' being It/2. The use of the term 'generation 
time' here, in a population genetics context, should not be con- 
fused with epidemiological interpretations of the generation 
time [35,36], which is defined at the individual level, as 
the time between the infection time of an infected person, and 
the infection time of his or her infector, rather than the average 
time between infections at a given time at the population level. 

Figure 2a illustrates the dynamics of v = It/ 2 over time in 
relation to the number of infected individuals I for the model 
given by equations (3.1)-(3.2). This demonstrates that v is out 
of phase with I, and also exhibits differences in the magni- 
tude of fluctuations. We also fitted a 'Bayesian skyride' 
model [37] to simulated coalescent intervals generated from 
a forwards-time, stochastic, discrete event version of the 
model, using a fast approximation that is fitted directly to a 
phylogenetic tree [38]. Such non-parametric models for v 
tend to smooth out fluctuations, as well as underestimating 
v at a given timepoint, as they average v over a time period 



as the harmonic mean [39]. Nevertheless, the skyride 
performs well in identifying the overall trajectory of v. 

These results argue that the term 'effective number of 
infections' is potentially misleading [3], as it implies that 
the ancestral size function v is directly proportional to the 
number of infected individuals. This is not generally the 
case, owing to a time-varying generation time r(t) over 
the course of an epidemic (figure 2b), which is short during 
the early stages of an epidemic, and becomes longer as the 
number of susceptible individuals becomes limiting. How- 
ever, there are time periods, such as the case of exponential 
growth, and at endemic equilibrium, where the generation 
time is constant, and hence v is proportional to the number 
of infected individuals [3,40]. 



(b) The number of leaves and cherries 

While the distribution of coalescent intervals is sufficient for 
inference of v under simple models, this is not the case for 
more realistic models that incorporate heterogeneity. As a 
prelude to discussing these models, we consider the 
number of cherries in the homogeneous model. As cherries 
are generated when two tips coalesce, we consider the 
dynamics of tips and internal branches separately. C(s) and 
B(s) will denote the set of tips and internal branches, with 
cardinality L(s) and £>(s), respectively. As before, lower-case 
symbols will denote elements of these sets. We denote the 
cumulative number of cherries as C, and consider the 
following transitions backwards in time: 
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The rationale for this scheme is as follows. When two 
leaves coalesce, they form a single branch, as well as a 
cherry. When a leaf and a branch coalesce, this either results 
in the loss of a leaf, or a loss of a branch and a change of state 
from a leaf to a branch; both of these occur at the same rate, 
and result in the same net changes in L and B (hence the 
factor of two). When two branches coalesce, this results in 
the loss of a branch. Consideration of the dynamics of tips 
also allows us to consider the proportion of lineages that clus- 
ter with at least one other sequence, a common approach 
when analysing HIV phylogenies [5], and is related to the 
concept of an operational taxonomic unit. The proportion of 
unclustered tips is P(s) = L(s)/A(0), and the distribution of 
tip lengths is — dP(s) / ds. The mean number of taxa per cluster 
[9], M, is included for completeness. If A(0) sequences are 
sampled at a single timepoint s = 0, then the initial conditions 
are L(0) = A(0), B(0) = 0, C(0) = 0 and M(0) = 1. This leads to 
the following set of differential equations for L, B, C and M: 
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The total number of cherries in a tree is simply the sol- 
ution of C(s) at the time to the most recent common 
ancestor (TMRCA). The only subtlety that arises is the calcu- 
lation of the TMRCA. In a standard coalescent framework, 
the TMRCA is the time at which the last two lineages 
coalesce; in an epidemiological model, this is the time at 
which the first transmission takes place involving two 
infected individuals ancestral to the sample, which may 
occur after the first transmission by the first infected individ- 
ual in the population. We make the approximation that the 
TMRCA is the time at which A = L + B = 1. 

Theory based on extended Polya urn models [15] has 
shown that the expected number of cherries in a tree gener- 
ated by a Yule or coalescent process is n/3, where n is the 
number of sequences. We considered the dynamics of cher- 
ries for the simple HIV model at endemic equilibrium. If 
we define a constant k=/ S i/(I 2 ), then the solution of 
equations (3.4) -(3.9) for A(s), L(s) and C(s) is as follows: 
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The time to the most recent common ancestor, 
%irca = CA(0) - 1)/(kA(0)), is the solution of A(s) = 1 for S, 
and the total number of cherries, C(smrca) = (A(0)/3) 
(1-1/A 3 ); for large n, C(s MRCA ) ~ A(0)/3, i.e. the total 
number of cherries in the differential equation model is approxi- 
mately the same as the mean from a Yule or coalescent process, 
with only a negligible difference for sample sizes typical of 
many viral studies, in the order of a hundred or more. 

(c) Sacking index 

As Sackin's index is the number of internal nodes (including 
the root) from each tip, summed over all tips, to obtain an 
approximation for Sackin's index we need to consider the 
coalescence rate, / SI (A/J) 2 , and the expected change in 
Sackin's index given a coalescence, which is 
2A(0)/A(s) = 2M, where M is the mean cluster size; the 
factor of two arises due to coalescent events affecting the 
counts for two lineages. A differential equation for K(s), the 
cumulative value of Sackin's index at time S in the past 
(where K(0) = 0), is as follows: 

~ MINI 

(3.14) 
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K(s MRCA ) provides an approximation for Sackin's statistic. 
Considering the simple HIV model at equilibrium, substitut- 
ing s MRC A = (-A(O) - l)/(#cA(0)) into equation (3.14) 
gives K(s MRCA ) = 2nlog(n). Although as the number of 
sequences tends to infinity, -K(smrca) — > E(Is(n)), the 




Figure 3. Dynamics of (0) the number of lineages, A, (b) the distribution of tip lengths, (c) the mean cluster size, M, (d) the fraction of sequences clustered, 1 - P, 
(e) the number of cherries, C, and (f) Sackin's index, K, for the simple model of HIV infection given by equations (3.1) -(3.2). Parameter values, initial conditions, 
and simulations are as in figure 2. 



difference between -K(smrca) and E(Js(n)) (on the order of 5% 
for sample sizes in the hundreds) is large enough 
that we standardize K(s MRC a) by 2n log(ft) rather 
than 2(iA (0) (^ + l) + 7e-l), i.e. K{s MRCA ) = {K{s MRCA )- 
2n log(n))/2n log(n). Figure 3 demonstrates the dynamics of 
these statistics for the model given by equations (3.1)-(3.2), 
which show excellent correspondence with results obtained 
with forwards-time stochastic simulations. 



F(t), comprising elements fij(t), the rate at which a lineage 
of type i generates another of type /, and a matrix G(t ), com- 
prising elements gy(t), the rate at which a lineage of type i 
changes to one of type j. These matrices are used to express 
the transition rates for changes in the number of ancestral 
lineages of different types [32], which for a two-type system 
are as follows: 



4. Heterogeneity and tree shape 

The model given by equations (3.1)-(3.2) considers only a 
single type of susceptible and a single type of infected individ- 
ual. More generally, we can consider models that include 
heterogeneity between individuals. Examples of such hetero- 
geneity include differences in infectivity at different times 
since infection, differences between hosts in contact rates, and 
geographical heterogeneity. Such heterogeneity can have a pro- 
found effect on the transmission dynamics. Incorporating 
heterogeneity in our phylodynamic models presents additional 
challenges, as we need to consider ancestral lineages for each 
type of infected individual, and coalescences between lineages 
of both the same and different types. 

(a) The number of lineages as a function of time 

We begin by considering the dynamics of the total number of 
lineages of each type for a two-type system, although these 
results can easily be extended to more than two types. Con- 
sidering forwards time, we define a time-varying matrix 
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This leads to the following differential equation 
for the dynamics of Ai(s), with an analogous equation 



for dA 2 (s)/ds: 
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of unclustered lineages that are in state ; at the tips of the 
tree, and are in state i at some time s in the past, P x y(s), can 
be obtained from the above in a similar fashion as in the 
single-population model. 

In this two-type system, there are three types of cherries, 
which we denote by Cy. The rates of coalescence of different 
types of tip (/ n , l 12 , l 2 \ and I22), and the types of cherry 
generated are as follows: 



(b) The number of tips and cherries 

Derivation of the dynamics of tips in this two-type system 
requires us to consider four types of tips, l 21/ l\ 2 and / 22 , 
based on the (unobserved) state at time S in the past (referred 
to by the first subscript) and the (observed) initial state at the 
time of sampling. As for simplicity, we consider sampling at 
a single timepoint, this is at s = 0. For example, transitions for 
the dynamics of tips involving / n are as follows: 
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Consideration of these transitions leads to the following 
differential equation for the dynamics of tips L n , with 
analogous expressions for the dynamics of the other tips: 
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The first line of equation (4.2) represents coalescence of 
tips, the second Invisible' transmissions, which result in a 
change in state and the third migration events. The fraction 
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It is important to note that we have to consider both 
lineages as potential 'sources' of infection when consider- 
ing coalescence between tips of different types, hence the 
factor of two for (InJn) — > h\ and (J21J22) —> ^2- Consider- 
ation of these transitions gives rise to the following 
differential equations for the dynamics of the number of 
cherries: 
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and 

^ = /^)%</, 2+ «^+/,,(^. («) 

The total number of cherries, C = Cn + C\ 2 + C 22 gives a 
measure of tree asymmetry, which can be compared against 
the null of « A(0)/3. To facilitate comparison of trees with 
different numbers of tips, L(0) = A(0), we define a 
normalized number of cherries, C norm = C/A(0). 



This equation simply captures flow between the states, either 
by coalescent events (captured by the matrix F) or by 
'migration' between states (captured by the matrix G). To 
verify the deterministic approximations, and to determine 
the variability in tree shape due to finite sample size, we 
also simulated trees using an approximation to the coalescent 
in structured populations developed by Volz [32], which 
takes the matrices F and G and the numbers of infected 
individuals, I\ and I 2/ at different time points as input. 



(i) The composition of cherries as a measure of clustering 

Capturing how different types cluster together on a tree, i.e. 
co-clustering, is difficult, as — except at the tips of the tree — 
the type of a lineage is not directly observable. Previously, 
we have derived equations for the correlations in numbers 
of sequences of different types in a cluster [33]. Here, we con- 
sider clustering in terms of the composition of different types 
of cherries, with relatively low values of C 12 being indicative 
of separation between types. We define the following 
measure of assortativity, based on that of Newman [41]. We 
denote a matrix E with elements as follows: 
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The assortativity coefficient, r, is defined as follows, where 

a i = E; e ij and = Ei e f 
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Under a null panmictic model, r = 0, while for a model where 
types are completely separated, r= 1. An estimate of r can 
also be obtained directly from a viral phylogeny. 

(c) Sackin's index 

Extending our approximation for Sackin's index (equation 
(3.14)) to two subpopulations is relatively straightforward, 
except now we have to consider three different types of coalesc- 
ence. When lines of type i and ; coalesce, they produce a clade 
with a mean number of descendants X;(s)/A;(s)+ X ; (s)/A ; (s), 
where X z (s) denotes the number of taxa descended from all 
extant lineages of type i at time s in the past, with 
Xi(s) = A(0). Such clades are produced at the rate 
Fij(s)(Ai(s) /Ii(s))(Aj(s) /Ij(s)). This leads to the following differ- 
ential equation for the cumulative Sackin index until time s: 
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In order to aid comparison with the simple model with- 
out heterogeneity, we derive a normalized version of k, 
K= (K-2nlog(n))/(2nlog(n)). 

The dynamics of X^s) can be described with the follow- 
ing equation, with an analogous equation for X 2 (s), with 
initial conditions Xi(0) = Ai(0),X 2 (0) = A 2 (0): 
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5. Applications 



To determine how structure and sampling affects phylody- 
namic patterns, in terms of the number of lineages over 
time, the extent to which sequences cluster and co-cluster, 
and the extent of tree asymmetry, we now consider two 
specific models of HIV that incorporate heterogeneity, 
either in infectiousness over the course of infection or 
differences between groups in contact rates. 

(a) Acute and chronic HIV infection 

The infectiousness of HIV-1 is thought to be much higher 
during acute infection than during chronic infection [42]. Pre- 
viously, we have analysed models of HIV transmission that 
include acute and chronic infection [9,32,33]. We recap 
some of the main results here, as well as extending them to 
consider more tree statistics. We denote the number of acutely 
infected individuals by I lf and the number of chronically 
infected individuals by I 2 . We allow acutely infected individ- 
uals to have a different per-act probability of infecting a 
susceptible person, /3 lf which we assume to be higher than 
the probability for a chronically infected person, i.e. 
fi x > p 2 . We assume that acute infection progresses to chronic 
infection at rate a, and that acutely infected individuals do 
not suffer any excess mortality due to HIV infection. These 
generalizations to the simple HIV model result in the 
following set of differential equations: 
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and 
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N(t) = S(t)+h(t)+I 2 (t). 



(5.4) 



The matrices F and G for this model are as follows: 
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(b) A model with risk structure 

To investigate the effects of heterogeneity in contact rates 
between individuals, we considered a model with two 



groups of individuals with different contact rates, c ir with the 
fraction of contacts made by a person in group i with a person 
in group ; denoted by p^ [43]. 
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where 

Ni(t) = Si(t) +Ii(t). 



(5.11) 



A number of assumptions can be made regarding the 
fraction of contacts of a person in group i with a person in 
group j, pij. A common assumption is proportionate mixing, 
in which the fraction of the contacts of group i with group j 
is equal to the fraction of the total contacts made by the popu- 
lation that are due to group /, such that p^ = cjNj/ c^N^. 
A more general formulation, that allows a wider range of 
mixing matrices, is the preferred mixing structure described 
by Jacquez et al. [43], in which a fraction p z of the contacts 
of group i are reserved for within-group contacts. The 
elements pu and p^ (i ^ j) under this model are as follows 
(note that this corrects an error in the term for p^ reported 
in Jacquez et al. [43]): 
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If pi = 0 for all i, then the contact matrix simplifies to pro- 
portionate mixing. The matrices F and G for this model are as 
follows: 
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For both the acute /chronic model, and the differential 
risk model, the differential equations captured the mean 
number of cherries, the assortativity coefficient and Sackins 
index calculated from simulations of multiple trees, at a 
fraction of the computational burden (see the electronic 
supplementary material, figures S4-S6). 



(c) Tree shape and structure 

We simulated the acute /chronic model and the differential 
risk model, assuming either proportionate or preferential 
mixing, for a range of sample fractions, (/>, from 0.1 to 0.9. 
Model outputs for the number of cherries, the assortativity 



coefficient and Sacking index at a fixed time of sampling 
are shown in figure 4. 

For the acute /chronic model, we considered a range of 
values for the relative infectiousness of acute and chronic HIV 
infection, while maintaining the same mean infectiousness 
over the infection period. Assortativity increased with higher 
infectiousness of acute infection, in line with our previous 
results examining the composition of clusters [33]. Although 
higher infectiousness resulted in more asymmetric trees, this 
depended on both the sample fraction and the choice of statistic 
in a nonlinear way. Sackin's index showed the greatest evidence 
of asymmetry for intermediate values for the relative infectious- 
ness of acute infection and was generally insensitive to 
sampling fraction. In contrast, when the sample fraction was 
high, asymmetry, as measured by a low number of cherries, 
was the greatest for high infectiousness during acute infection. 

For proportionate mixing, the differential risk models over a 
range of contact rates for the high-risk population relative to the 
low-risk population demonstrated asymmetry similar in magni- 
tude to those of the acute/ chronic model, in terms of the number 
of cherries, but showed less extreme values for Sackin's index. 
Assortativity was generally low, with small negative values of 
the assortativity coefficient for greater relative contact rates for 
the high-risk group. However, although variation in contact 
rates could result in asymmetric phylogenies under proportion- 
ate mixing, this effect was almost completely eliminated when 
mixing was preferential (p = 0.9), as such population subdivi- 
sion limits the impact that individuals with a high contact rate 
can have on the entire viral phylogeny. Also in contrast to pro- 
portionate mixing, assortativity was much more marked when 
mixing was preferential. The assortativity coefficient, r, was rela- 
tively insensitive to contact rate variation, being mainly driven 
by the mixing between the high- and low-risk groups, captured 
by the parameter p (results not shown), although the assortativ- 
ity of different types of infected individuals, r, may be much less 
than the assortativity of different types of all individuals, p, 
especially for low sample fractions. 



6. Discussion 

We have extended our previous differential equation-based 
framework for modelling the NLFT to consider tree asymme- 
try and, in the case of structured population models, 
co-clustering of different states. Of note is that our models 
generate trajectories of measures of asymmetry and assorta- 
tivity over evolutionary time, rather than just summary 
measures over the whole tree. We have also presented 
examples of how heterogeneity in the susceptible and /or 
infected individuals can result in different phylodynamic pat- 
terns. The two models presented here have a wide range of 
applications. For example, the model used for acute and 
chronic HIV infection can also be used to consider a simple 
form of treatment, where I\ and I 2 represent the number of 
untreated and treated individuals, respectively, and a rep- 
resents the rate of going on treatment, while the model of 
different risk groups can be used to examine heterosexual 
spread of HIV, by setting c#, i= 1,2 to zero, or a spatial 
model, where 'migration' of infections occurs via trans- 
mission between individuals in different geographical areas. 

Given the nonlinearities in the system, it is hard to 
develop scenarios where the impact of a single parameter 
can be examined. For example, changing the relative 
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Figure 4. Asymmetry and clustering assuming a range of sampling fractions, from 0 = 0.1 (red) to $ = 0.9 (violet) in steps of 0.1, for different values of the 
relative infectiousness of acute infection (a), and the relative contact rate in the differential risk model, assuming either (b) proportionate mixing (p = 0) or 
(c) preferential mixing (p = 0.9). Parameter values for the acute/chronic model are as follows: c = 1, a = * y = ^ /x : 
/t(0) = 1, / 2 (0) = 0. The infectivity parameters fa were constrained such that fa = /3 (d-\ + d 2 )/(kd<\ + d 2 ) and fa = kfa, where /3 is the mean 
infectiousness (with /3 = 0.01), d-, the mean duration of stage / and k the fold increase in infectiousness during acute infection. Parameter values for the 
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infectiousness of acute infection changes the whole trajectory 
of the epidemic, such that sampling at a fixed time since the 
introduction is not strictly comparable across parameter 
values. As our models comprise a relatively small number 
of differential equations, which can be simulated quickly, 
they are well suited for exploring how tree shape is affected 
by population structure. In contrast, simulating trees may 
be extremely time consuming, especially for large numbers 
of taxa, and large numbers of simulations for a given par- 
ameter set may be needed, owing to the high variability in 
many tree shape statistics. 

Our results suggest that in many cases, the level of asym- 
metry of the tree may be rather insensitive to the underlying 
population structure. This is not particularly surprising for a 
number of reasons, including the relatively weak selection of 
HIV-1 at the population level [2], the averaging of asymmetry 
over the entire tree, and that the risk among those infected is 
likely to be higher and less variable among infected individ- 
uals than susceptible individuals. Given these considerations, 



it is somewhat surprising that Leventhal et al. [13] found 
asymmetry in their analysis of the Swiss HIV epidemic, 
especially as the overall phylogeny comprised distinct risk 
groups, which our results suggest generates less, not more 
asymmetry. This may be due to biases in when each risk 
group was sampled, and /or the unusually high sampling 
fraction in this epidemic (30-40%). Indeed, the three largest 
transmission clusters, which were more homogeneous in 
terms of risk (one associated with heterosexual risk /injection 
drug use and two clusters associated with men who have sex 
with men) showed much lower asymmetry (J s <0.5). Our 
models also show that factors other than contact rate, such 
as high infectiousness during acute infection, may have a 
more dramatic impact on asymmetry; while high-risk 
groups may be at a minority in a population, all infected indi- 
viduals go through a period of increased infectiousness 
during acute infection. Moreover, as sequences sampled at 
different times will generate more asymmetric trees for 
rapidly evolving pathogens such as HIV-1 (see the electronic 



supplementary material, figure S2), measures of asymmetry 
may be difficult to interpret for serial samples, which are 
commonplace in HIV-1 phylogenetic studies, and difficult 
to compare between studies that have different temporal 
sampling patterns. 

The composition of cherries may be highly informative 
about patterns of mixing between populations, provided that 
the sample size is sufficient to include representatives from 
all groups. However, in order to calculate the composition of 
cherries, we need to specify subpopulations a priori, and this 
may be difficult to perform, especially for variables such as 
sexual contact rates. Although, ideally, other data such as be- 
havioural data should be collected in order to identify risk 
groups, as clustering is also related to contact rates, it may be 
possible to identify individuals with higher contact rates 
based on patterns of clustering. However, patterns of clustering 
have to be interpreted carefully, as differences in clustering may 
also be driven by differences in the time since infection at 
which samples are taken ([33]; electronic supplementary 
material, figure S7) and by the underlying frequencies of the 
groups (see the electronic supplementary material, figure S8). 

Our simulations assumed a random sample of taxa across 
all groups. In practice, random sampling of infected individ- 
uals may not be feasible, or in some cases it may even be 
desirable to oversample particular groups. For example, 
while our model of acute and chronic HIV infection predicts 
increasing assortativity as the assumed relative infectiousness 
during acute infection increases, it may be difficult to test this 
empirically, as generally acutely infected individuals are rela- 
tively infrequent, and sampling variation in the assortativity 
coefficient may be high (see the electronic supplementary 
material, figures S4-S6). Our framework can accommodate 



over- or under-sampling of specific groups, although prior 
information on the size of each group is highly desirable in 
order to make accurate inferences. 

We have focused on developing and simulating phylody- 
namic models, rather than inferring parameter values of these 
models from sequence data. As highlighted in our discussion 
of the simple HIV model, some simple epidemiological 
models are just special cases of the time-varying coalescent 
model, for which methods of inference are well established. 
While the theory presented for structured models can also be 
used as a basis for inference, full likelihood-based fitting may 
be computationally intensive, and approximations to the likeli- 
hood may be required [32]. The models presented here, which 
can generate a number of summary measures of phylogenetic 
structure, can be used as the basis for Approximate Bayesian 
Computation (ABC) approaches [44], in which parameter 
values are found that generate simulated data that resemble 
the observed data. The use of more biologically realistic phylo- 
dynamic models can be used not only to determine whether a 
population deviates from random mixing [13], but also to 
determine the type of population structure. By linking asym- 
metry, assortativity and the number of lineages through time, 
bespoke models of viral phylodynamics may be able to provide 
rich insights into the dynamics of viral transmission. 
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